{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Copyright 2019 Google LLC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# scikit-learn Training on AI Platform\n",
    "This notebook uses the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to demonstrate how to train a model on Ai Platform."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# How to bring your model to AI Platform\n",
    "Getting your model ready for training can be done in 3 steps:\n",
    "1. Create your python model file\n",
    "    1. Add code to download your data from [Google Cloud Storage](https://cloud.google.com/storage) so that AI Platform can use it\n",
    "    1. Add code to export and save the model to [Google Cloud Storage](https://cloud.google.com/storage) once AI Platform finishes training the model\n",
    "1. Prepare a package\n",
    "1. Submit the training job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Prerequisites\n",
    "Before you jump in, let’s cover some of the different tools you’ll be using to get online prediction up and running on AI Platform. \n",
    "\n",
    "[Google Cloud Platform](https://cloud.google.com/) lets you build and host applications and websites, store data, and analyze data on Google's scalable infrastructure.\n",
    "\n",
    "[AI Platform](https://cloud.google.com/ml-engine/) is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.\n",
    "\n",
    "[Google Cloud Storage](https://cloud.google.com/storage/) (GCS) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.\n",
    "\n",
    "[Cloud SDK](https://cloud.google.com/sdk/) is a command line tool which allows you to interact with Google Cloud products. In order to run this notebook, make sure that Cloud SDK is [installed](https://cloud.google.com/sdk/downloads) in the same environment as your Jupyter kernel.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 0: Setup\n",
    "* [Create a project on GCP](https://cloud.google.com/resource-manager/docs/creating-managing-projects)\n",
    "* [Create a Google Cloud Storage Bucket](https://cloud.google.com/storage/docs/quickstart-console)\n",
    "* [Enable AI Platform Training and Prediction and Compute Engine APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.217405014.1312742076.1516128282-1417583630.1516128282)\n",
    "* [Install Cloud SDK](https://cloud.google.com/sdk/downloads)\n",
    "* [Install scikit-learn](http://scikit-learn.org/stable/install.html) [Optional: used if running locally]\n",
    "* [Install pandas](https://pandas.pydata.org/pandas-docs/stable/install.html) [Optional: used if running locally]\n",
    "\n",
    "These variables will be needed for the following steps.\n",
    "* `TRAINER_PACKAGE_PATH <./census_training>` - A packaged training application that will be staged in a Google Cloud Storage location. The model file created below is placed inside this package path.\n",
    "* `MAIN_TRAINER_MODULE <census_training.train>` - Tells AI Platform which file to execute. This is formatted as follows <folder_name.python_file_name>\n",
    "* `JOB_DIR <gs://$BUCKET_NAME/scikit_learn_job_dir>` - The path to a Google Cloud Storage location to use for job output.\n",
    "* `RUNTIME_VERSION <1.9>` - The version of AI Platform to use for the job. If you don't specify a runtime version, the training service uses the default AI Platform runtime version 1.0. [See the list of runtime versions for more information](https://cloud.google.com/ml-engine/docs/tensorflow/runtime-version-list).\n",
    "* `PYTHON_VERSION <3.5>` - The Python version to use for the job. Python 3.5 is available with runtime version 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7.\n",
    "\n",
    "** Replace: **\n",
    "* `PROJECT_ID <YOUR_PROJECT_ID>` - with your project's id. Use the PROJECT_ID that matches your Google Cloud Platform project.\n",
    "* `BUCKET_NAME <YOUR_BUCKET_NAME>` - with the bucket id you created above.\n",
    "* `JOB_DIR <gs://YOUR_BUCKET_NAME/scikit_learn_job_dir>` - with the bucket id you created above.\n",
    "* `REGION <REGION>` - select a region from [here](https://cloud.google.com/ml-engine/docs/regions) or use the default '`us-central1`'. The region is where the model will be deployed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: PROJECT_ID=<PROJECT_ID>\n",
      "env: BUCKET_NAME=<BUCKET_NAME>\n",
      "env: REGION=us-central1\n",
      "env: TRAINER_PACKAGE_PATH=./census_training\n",
      "env: MAIN_TRAINER_MODULE=census_training.train\n",
      "env: JOB_DIR=gs://<BUCKET_NAME>/scikit_learn_job_dir\n",
      "env: RUNTIME_VERSION=1.9\n",
      "env: PYTHON_VERSION=3.5\n"
     ]
    }
   ],
   "source": [
    "%env PROJECT_ID <PROJECT_ID>\n",
    "%env BUCKET_NAME <BUCKET_NAME>\n",
    "%env REGION us-central1\n",
    "%env TRAINER_PACKAGE_PATH ./census_training\n",
    "%env MAIN_TRAINER_MODULE census_training.train\n",
    "%env JOB_DIR gs://<BUCKET_NAME>/scikit_learn_job_dir\n",
    "%env RUNTIME_VERSION 1.9\n",
    "%env PYTHON_VERSION 3.5\n",
    "! mkdir census_training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "##  The data\n",
    "The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample\n",
    "uses for training is provided by the [UC Irvine Machine Learning\n",
    "Repository](https://archive.ics.uci.edu/ml/datasets/). We have hosted the data on a public GCS bucket `gs://cloud-samples-data/ml-engine/sklearn/census_data/`. \n",
    "\n",
    " * Training file is `adult.data`\n",
    " * Evaluation file is `adult.test` (not used in this notebook)\n",
    "\n",
    "Note: Your typical development process with your own data would require you to upload your data to GCS so that AI Platform can access that data. However, in this case, we have put the data on GCS to avoid the steps of having you download the data from UC Irvine and then upload the data to GCS."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Disclaimer\n",
    "This dataset is provided by a third party. Google provides no representation,\n",
    "warranty, or other guarantees about the validity or any other aspects of this dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 1: Create your python model file\n",
    "\n",
    "First, we'll create the python model file (provided below) that we'll upload to AI Platform. This is similar to your normal process for creating a scikit-learn model. However, there are two key differences:\n",
    "1. Downloading the data from GCS at the start of your file, so that AI Platform can access the data.\n",
    "1. Exporting/saving the model to GCS at the end of your file, so that you can use it for predictions.\n",
    "\n",
    "The code in this file loads the data into a pandas DataFrame that can be used by scikit-learn. Then the model is fit against the training data. Lastly, sklearn's built in version of joblib is used to save the model to a file that can be uploaded to [AI Platform's prediction service](https://cloud.google.com/ml-engine/docs/scikit/getting-predictions#deploy_models_and_versions).\n",
    "\n",
    "**REPLACE Line 18: BUCKET_NAME = '<BUCKET_NAME>' with your GCS BUCKET_NAME**\n",
    "\n",
    "Note: In normal practice you would want to test your model locally on a small dataset to ensure that it works, before using it with your larger dataset on AI Platform. This avoids wasted time and costs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing ./census_training/train.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile ./census_training/train.py\n",
    "# [START setup]\n",
    "import datetime\n",
    "import pandas as pd\n",
    "\n",
    "from google.cloud import storage\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.externals import joblib\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "from sklearn.pipeline import FeatureUnion\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import LabelBinarizer\n",
    "\n",
    "\n",
    "# TODO: REPLACE '<BUCKET_NAME>' with your GCS BUCKET_NAME\n",
    "BUCKET_NAME = '<BUCKET_NAME>'\n",
    "# [END setup]\n",
    "\n",
    "\n",
    "# ---------------------------------------\n",
    "# 1. Add code to download the data from GCS (in this case, using the publicly hosted data).\n",
    "# AI Platform will then be able to use the data when training your model.\n",
    "# ---------------------------------------\n",
    "# [START download-data]\n",
    "# Public bucket holding the census data\n",
    "bucket = storage.Client().bucket('cloud-samples-data')\n",
    "\n",
    "# Path to the data inside the public bucket\n",
    "blob = bucket.blob('ml-engine/sklearn/census_data/adult.data')\n",
    "# Download the data\n",
    "blob.download_to_filename('adult.data')\n",
    "# [END download-data]\n",
    "\n",
    "\n",
    "# ---------------------------------------\n",
    "# This is where your model code would go. Below is an example model using the census dataset.\n",
    "# ---------------------------------------\n",
    "# [START define-and-load-data]\n",
    "# Define the format of your input data including unused columns (These are the columns from the census data files)\n",
    "COLUMNS = (\n",
    "    'age',\n",
    "    'workclass',\n",
    "    'fnlwgt',\n",
    "    'education',\n",
    "    'education-num',\n",
    "    'marital-status',\n",
    "    'occupation',\n",
    "    'relationship',\n",
    "    'race',\n",
    "    'sex',\n",
    "    'capital-gain',\n",
    "    'capital-loss',\n",
    "    'hours-per-week',\n",
    "    'native-country',\n",
    "    'income-level'\n",
    ")\n",
    "\n",
    "# Categorical columns are columns that need to be turned into a numerical value to be used by scikit-learn\n",
    "CATEGORICAL_COLUMNS = (\n",
    "    'workclass',\n",
    "    'education',\n",
    "    'marital-status',\n",
    "    'occupation',\n",
    "    'relationship',\n",
    "    'race',\n",
    "    'sex',\n",
    "    'native-country'\n",
    ")\n",
    "\n",
    "\n",
    "# Load the training census dataset\n",
    "with open('./adult.data', 'r') as train_data:\n",
    "    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)\n",
    "\n",
    "# Remove the column we are trying to predict ('income-level') from our features list\n",
    "# Convert the Dataframe to a lists of lists\n",
    "train_features = raw_training_data.drop('income-level', axis=1).values.tolist()\n",
    "# Create our training labels list, convert the Dataframe to a lists of lists\n",
    "train_labels = (raw_training_data['income-level'] == ' >50K').values.tolist()\n",
    "# [END define-and-load-data]\n",
    "\n",
    "\n",
    "# [START categorical-feature-conversion]\n",
    "# Since the census data set has categorical features, we need to convert\n",
    "# them to numerical values. We'll use a list of pipelines to convert each\n",
    "# categorical column and then use FeatureUnion to combine them before calling\n",
    "# the RandomForestClassifier.\n",
    "categorical_pipelines = []\n",
    "\n",
    "# Each categorical column needs to be extracted individually and converted to a numerical value.\n",
    "# To do this, each categorical column will use a pipeline that extracts one feature column via\n",
    "# SelectKBest(k=1) and a LabelBinarizer() to convert the categorical value to a numerical one.\n",
    "# A scores array (created below) will select and extract the feature column. The scores array is\n",
    "# created by iterating over the COLUMNS and checking if it is a CATEGORICAL_COLUMN.\n",
    "for i, col in enumerate(COLUMNS[:-1]):\n",
    "    if col in CATEGORICAL_COLUMNS:\n",
    "        # Create a scores array to get the individual categorical column.\n",
    "        # Example:\n",
    "        #  data = [39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married', 'Adm-clerical', \n",
    "        #         'Not-in-family', 'White', 'Male', 2174, 0, 40, 'United-States']\n",
    "        #  scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
    "        #\n",
    "        # Returns: [['State-gov']]\n",
    "        # Build the scores array\n",
    "        scores = [0] * len(COLUMNS[:-1])\n",
    "        # This column is the categorical column we want to extract.\n",
    "        scores[i] = 1\n",
    "        skb = SelectKBest(k=1)\n",
    "        skb.scores_ = scores\n",
    "        # Convert the categorical column to a numerical value\n",
    "        lbn = LabelBinarizer()\n",
    "        r = skb.transform(train_features)\n",
    "        lbn.fit(r)\n",
    "        # Create the pipeline to extract the categorical feature\n",
    "        categorical_pipelines.append(\n",
    "            ('categorical-{}'.format(i), Pipeline([\n",
    "                ('SKB-{}'.format(i), skb),\n",
    "                ('LBN-{}'.format(i), lbn)])))\n",
    "# [END categorical-feature-conversion]\n",
    "\n",
    "# [START create-pipeline]\n",
    "# Create pipeline to extract the numerical features\n",
    "skb = SelectKBest(k=6)\n",
    "# From COLUMNS use the features that are numerical\n",
    "skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]\n",
    "categorical_pipelines.append(('numerical', skb))\n",
    "\n",
    "# Combine all the features using FeatureUnion\n",
    "preprocess = FeatureUnion(categorical_pipelines)\n",
    "\n",
    "# Create the classifier\n",
    "classifier = RandomForestClassifier()\n",
    "\n",
    "# Transform the features and fit them to the classifier\n",
    "classifier.fit(preprocess.transform(train_features), train_labels)\n",
    "\n",
    "# Create the overall model as a single pipeline\n",
    "pipeline = Pipeline([\n",
    "    ('union', preprocess),\n",
    "    ('classifier', classifier)\n",
    "])\n",
    "# [END create-pipeline]\n",
    "\n",
    "\n",
    "# ---------------------------------------\n",
    "# 2. Export and save the model to GCS\n",
    "# ---------------------------------------\n",
    "# [START export-to-gcs]\n",
    "# Export the model to a file\n",
    "model = 'model.joblib'\n",
    "joblib.dump(pipeline, model)\n",
    "\n",
    "# Upload the model to GCS\n",
    "bucket = storage.Client().bucket(BUCKET_NAME)\n",
    "blob = bucket.blob('{}/{}'.format(\n",
    "    datetime.datetime.now().strftime('census_%Y%m%d_%H%M%S'),\n",
    "    model))\n",
    "blob.upload_from_filename(model)\n",
    "# [END export-to-gcs]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 2: Create Trainer Package\n",
    "Before you can run your trainer application with AI Platform, your code and any dependencies must be placed in a Google Cloud Storage location that your Google Cloud Platform project can access. You can find more info [here](https://cloud.google.com/ml-engine/docs/tensorflow/packaging-trainer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing ./census_training/__init__.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile ./census_training/__init__.py\n",
    "# Note that __init__.py can be an empty file.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 3: Submit Training Job\n",
    "Next we need to submit the job for training on AI Platform. We'll use gcloud to submit the job which has the following flags:\n",
    "\n",
    "* `job-name` - A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter). In this case: `census_training_$(date +\"%Y%m%d_%H%M%S\")`\n",
    "* `job-dir` - The path to a Google Cloud Storage location to use for job output.\n",
    "* `package-path` - A packaged training application that is staged in a Google Cloud Storage location. If you are using the gcloud command-line tool, this step is largely automated.\n",
    "* `module-name` - The name of the main module in your trainer package. The main module is the Python file you call to start the application. If you use the gcloud command to submit your job, specify the main module name in the --module-name argument. Refer to Python Packages to figure out the module name.\n",
    "* `region` - The Google Cloud Compute region where you want your job to run. You should run your training job in the same region as the Cloud Storage bucket that stores your training data. Select a region from [here](https://cloud.google.com/ml-engine/docs/regions) or use the default '`us-central1`'.\n",
    "* `runtime-version` - The version of AI Platform to use for the job. If you don't specify a runtime version, the training service uses the default AI Platform runtime version 1.0. See the list of runtime versions for more information.\n",
    "* `python-version` - The Python version to use for the job. Python 3.5 is available with runtime version 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7.\n",
    "* `scale-tier` - A scale tier specifying the type of processing cluster to run your job on. This can be the CUSTOM scale tier, in which case you also explicitly specify the number and type of machines to use.\n",
    "\n",
    "Note: Check to make sure gcloud is set to the current PROJECT_ID"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Updated property [core/project].\r\n"
     ]
    }
   ],
   "source": [
    "! gcloud config set project $PROJECT_ID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Submit the training job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Job [census_training_20180718_083452] submitted successfully.\r\n",
      "Your job is still active. You may view the status of your job with the command\r\n",
      "\r\n",
      "  $ gcloud ml-engine jobs describe census_training_20180718_083452\r\n",
      "\r\n",
      "or continue streaming the logs with the command\r\n",
      "\r\n",
      "  $ gcloud ml-engine jobs stream-logs census_training_20180718_083452\r\n",
      "jobId: census_training_20180718_083452\r\n",
      "state: QUEUED\r\n"
     ]
    }
   ],
   "source": [
    "! gcloud ml-engine jobs submit training census_training_$(date +\"%Y%m%d_%H%M%S\") \\\n",
    "  --job-dir $JOB_DIR \\\n",
    "  --package-path $TRAINER_PACKAGE_PATH \\\n",
    "  --module-name $MAIN_TRAINER_MODULE \\\n",
    "  --region $REGION \\\n",
    "  --runtime-version=$RUNTIME_VERSION \\\n",
    "  --python-version=$PYTHON_VERSION \\\n",
    "  --scale-tier BASIC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# [Optional] StackDriver Logging\n",
    "You can view the logs for your training job:\n",
    "1. Go to https://console.cloud.google.com/\n",
    "1. Select \"Logging\" in left-hand pane\n",
    "1. Select \"Cloud ML Job\" resource from the drop-down\n",
    "1. In filter by prefix, use the value of $JOB_NAME to view the logs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# [Optional] Verify Model File in GCS\n",
    "View the contents of the destination model folder to verify that model file has indeed been uploaded to GCS.\n",
    "\n",
    "Note: The model can take a few minutes to train and show up in GCS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "! gsutil ls gs://$BUCKET_NAME/census_*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Next Steps:\n",
    "The AI Platform online prediction service manages computing resources in the cloud to run your models. Check out the [documentation pages](https://cloud.google.com/ml-engine/docs/scikit/) that describe the process to get online predictions from these exported models using AI Platform."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
